Install a package from a different conda channel

Open a terminal and run the following commands:

conda activate <ENV_NAME>
conda install -c conda-forge pandas-profiling -y

We're installing pandas-profiling through this method because Anaconda's default channel contains an outdated version of this package, whereas the channel conda-forge has an updated version.

Context

The data we will be using through the pratical classes comes from a small relational database whose schema can be seen below: alt text

Reading the Data

Metadata

Initial Analysis

Pandas user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

Pandas 10 min tutorial: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

Problems:

Take a closer look and point out possible problems:

(hint: a missing values in pandas is represented with a NaN value)

Visual Exploration

Matplotlib tutorials: https://matplotlib.org/3.3.1/tutorials/index.html

Matplotlib gallery: https://matplotlib.org/3.3.1/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py

Seaborn tutorials: https://seaborn.pydata.org/tutorial.html

Seaborn gallery: https://seaborn.pydata.org/examples/index.html

More examples for visualizing distributions:

Pyplot-style vs Object-Oriented-style

Numeric Variables' Univariate Distribution

What information can we extract from the plots above?

Insights:

Pairwise Relationship of Numerical Variables

Insights:

Example of Visualization formatting

Categorical/Low Cardinality Variables' Absolute Frequencies

What information can we extract from the plot above?

Using the same logic from the multiple box plot figure above, build a multiple bar plot figure for each non-metric variable:

Insights:

Comparing two categorical variables

Comparing a categorical variable vs continuous (or discrete) variables

Explore categorical data vs continuous and discrete data

Another example of visualization. Although it is not a simple visualization to produce, it can be very informative.

Metric Variables' Correlation Matrix

A tool to assist you through your exploratory data analysis

Optionally, you may use pandas-profiling as a first approach to your data analysis. Remember, although this tool provides excelent insights about the data you're working with, it is not enough to perform a proper analysis.